Improving Credit Scoring by Generalized Additive Model
نویسندگان
چکیده
Logistic Regression has been widely used in the financial service industry for credit scoring models. Despite its advantages in easy interpretation and low computing cost, Logistic Regression is under the criticism of failure to model the nonlinear features of the predictors effect on the dependent variable and therefore might lead to unsatisfactory results. Modern statistical techniques such as Neural Network and Projection Pursuit Regression have been proven successful in the nonlinear modeling. However, this success comes with the price of interpretability. Introduced by Hastie and Tibshirani, Generalized Additive Model provides the ability to detect the nonlinear patterns without sacrificing interpretability. The purpose of this paper is to introduce Generalize Additive Model as a promising alternative to Logistic Regression for credit scoring. The modeling strategies for both Logistic Regression and Generalized Additive Model with SAS/STAT are investigated to guide readers from Logistic Regression through Generalized Additive Model. The comparison of predictive performance of the two modeling techniques shows the superiority of the Generalized Additive Model. We hope that statistical analysts will be able to implement Generalize Additive Model with SAS/STAT in real business world by following the practical guidance outlined in this paper. INTRODUCTION How to assess the credit risk has become crucial for credit card companies and commercial banks due to the explosive growth of credit market in recent years. Statistical models, such as Linear Discriminant Analysis, Logistic Regression, Classification and Regression Tree, and Neural Network, are widely used to evaluate the credit worthiness of potential borrowers in order to reduce the default risk (Franke, Hardle, and Stahl 2000, Shao 2004). Logistic Regression, which is a special case of Generalized Linear Models (McCullagh and Nelder 1989), is the most widely used statistical model in the credit scoring industry. Introduced by McCullagh and Nelder, Generalized Linear Models provide a unified framework to model response from any member of the exponential family distributions, such as Gaussian, Binomial, or Poisson. In Generalized Linear Model, the dependent variable Y is related to the linear combination of predictors B1*X1+...+Bm*Xm in the form of G(E(Y|X)) = G(u) = B0+B1*X1+...+Bm*Xm, where u is the mean of dependent variable Y and G(.) is a monotonic differentiable function known as Link Function. For Logistic Regression, Generalized Linear Model can be expressed as Logit(p) = Log(p/(1–p)) = B0+B1*X1+...+Bm*Xm, where u is p = Prob(Y=1|X) and G(.) is Logit(.) function in this case. In SAS/STAT, both LOGISTIC and GENMOD procedures can be used to build Logistic Regression. While LOGISTIC procedure is specifically designed for Logistic Regression and provides a variety of features for model selection and diagnosis, GENMOD procedure can be used to fit a wide range of Generalized Linear Models with response from any exponential distribution. Considering the popularity of Logistic Regression in current business applications, we will use it as a benchmark model and compare it with Generalized Additive Model. In Generalized Linear Model, the relationship between response and predictors is assumed to be linear. However, a potential risk of such assumption is model misspecification. While the effects of predictors are often nonlinear in real world, it is always challenging to find an appropriate functional form of the partial effect of predictors on the response variable. As a result, Generalized Linear Model might not always be able to provide an appropriate fit for the data of complex structures. Proposed by Hastie and Tibshirani, Generalized Additive Model (Hastie and Tibshirani 1990) relaxes the linearity assumption of Generalized Linear Model and assumes that the dependent variable Y is dependent on the univariate smooth terms of predictors rather than predictors themselves. Therefore, the functional form of Generalized Linear Model G(E(Y|X)) = B0+B1*X1+...+Bm*Xm could be further extended to become G(E(Y|X)) = B0+S1(X1)+...+Sm(Xm), where Bi*Xi is replaced by a nonparametric smooth function Si(Xi) for predictor Xi. The function S(.) is estimated in a flexible manner. For nonlinear terms, any nonparametric smoothing method can be used. In SAS/STAT, GAM procedure is using B-spline and local regression for univariate smoothing and thin-plate for bivariate. However, the function S(.) doesn’t have to be nonlinear for all predictors in Generalized Additive Model. In practice, it is more often to mix linear terms with nonlinear ones, which gives a semi-parametric variation of Generalized Additive Model G(E(Y|X)) = B0+B1*X1+...+Bm*Xm+Sm+1(Xm+1)+...+Sm+n(Xm+n). Since each individual effect is estimated using univariate smoother, the Curse of Dimensionality is avoided. Moreover, the important feature of interpretability in Generalized Linear Model is retained in Generalized Additive Model. The function S(.) is the analogy of co-efficient in Generalized Linear Model. The estimate of Si(Xi) explains how the response changes along with the corresponding predictor Xi. SAS Global Forum 2007 Data Mining and Predictive Modeling
منابع مشابه
A case study on using generalized additive models to fit credit rating scores
We consider the estimation of credit scores by means of semiparametric logit models. In credit scoring, the fitted rating score shall not only provide an optimal classification result but serves also as a modular component of a (typically quite complex) rating system. This means in particular that a rating score should be given by a linearly weighted sum of rating factors. That way the rating p...
متن کاملAn Investigation into the Use of Generalized Additive Neural Networks in Credit Scoring
Logistic regression occupies a central position in the field of credit scoring as it is relatively well understood and an explicit formula can be derived on which credit decisions may be based. Although artificial neural networks may be more powerful than logistic regression, it is not widely used in credit scoring because it is a black box with respect to interpretation and the absence of reas...
متن کاملBayesian Inference for Generalized Additive Regression based on Dynamic Models
We present a general approach for Bayesian inference via Markov chain Monte Carlo MCMC simulation in generalized additive semiparametric and mixed models It is particularly appropriate for discrete and other fundamentally non Gaussian responses where Gibbs sampling techniques developed for Gaussian models cannot be applied We use the close relation between nonparametric regression and dynamic o...
متن کاملImproving a Credit Scoring Model by Incorporating Bank Statement Derived Features
In this paper, we investigate the extent to which features derived from bank statements provided by loan applicants, and which are not declared on an application form, can enhance a credit scoring model for a New Zealand lending company. Exploring the potential of such information to improve credit scoring models in this manner has not been studied previously. We construct a baseline model base...
متن کاملUsing the Hybrid Model for Credit Scoring (Case Study: Credit Clients of microloans, Bank Refah-Kargeran of Zanjan, Iran)
In any country, commercial banks lay the groundwork for economic growth by collecting national resources and capitals and allocating them to different economic sectors. Optimal allocation of resources is especially important in achieving this goal. Banks with an effective and dynamic system of customer assessment can efficiently allocate their resources to customers regardless of their geograph...
متن کامل